Correcting ‘Wrong-Column’ Errors in Text Databases
نویسندگان
چکیده
We present a novel data-driven approach for detecting and correcting errors in text databases. We focus on information that was accidentally entered in an incorrect column. Unlike machine-learning approaches to data cleaning that assume the database cells to contain atomic or numeric content, our method takes into account substrings of textual cells, and treats error detection and correction as a text categorisation task. Errors are detected at points where the classifier disagrees with the data; corrections are the suggestions put forward by the classifier. We demonstrate that the method is suited for high-recall detection of errors in freetext columns of a zoological database, with a high correction accuracy as well.
منابع مشابه
Spotting The 'Odd-One-Out': Data-Driven Error Detection And Correction In Textual Databases
We present two methods for semiautomatic detection and correction of errors in textual databases. The first method (horizontal correction) aims at correcting inconsistent values within a database record, while the second (vertical correction) focuses on values which were entered in the wrong column. Both methods are data-driven and language-independent. We utilise supervised machine learning, b...
متن کاملتصحیح قیاسی برخی از عبارات دشوار شرح شطحیات
The current article aims at reviewing and correcting some difficult and obscure words in Description of Shathyyāt written by Roozbehān Baqali. Similar to the mystic texts, this book is found to use technical writing style which causes it to be one of the complicated mystic passages. Some complexities of this book, however, are assumed to be originated in errors and inaccuracies of text. A Compa...
متن کاملCan Confidence Scores Post-editing Speech Recog
When dictating with speech recognition, most of the user’s time is spent correcting errors. To decrease the burden we propose new editor functions specifically to speed up the correction process. The idea is to use a recognition confidence measure to predict which words are likely to be in error, to display that information to the user by highlighting suspect words, and to provide a command to ...
متن کاملرفع اعوجاج هندسی متون بهکمک اطلاعات هندسی خطوط متن
Document images produced by scanners or digital cameras usually have photometric and geometric distortions. If either of these effects distorts document, recognition of words from such a document image using OCR is subject to errors. In this paper we propose a novel approach to significantly remove geometric distortion from document images. In this method first we extract document lines from do...
متن کاملFrequency, Type and Causes of Medication Errors in Pediatric Wards of Hospitals in Yazd, the Central of Iran
Background Medication errors are among the most common medical errors which are used as an indicator to assess patients’ safety in hospitals. Thereby the aim of this study was to investigate the frequency, type and causes of medication errors in children's ward at hospitals in Yazd- Iran. Materials and Methods This descriptive-analytical study was conducted during 6 months from Jan to Jun 2015....
متن کامل